In this project i’ve been given data from the Lifelines project. This project gave people who lived in the provinces of Drenthe, Groningen and Friesland a questionair that collected data on numerous catagories. These catagories include things like; Weight, Height, Finances, different diseases, different lifestyle choices, differences in social status (financials, degree, etc). These different catagories have been measured in different ways, some are measured according to external rule sets. For example, Data on sleep quality (section: lifestyle and environment) has been reviewed by researchers from the Erasmus MC who have developed a PSQI derivative for the Lifelines cohort. The variable included in the public health dataset is an indicator of people who experience ‘poor sleep quality’. They have given a score back being either 0 meaning bad sleep quality, or a score of 1 meaning good sleep quality. The task at hand will be creating a nice and sleek datadashboard that display’s certain catagories that are important. The catagories that we’re found to be important in this research will be lifestyle catagories. This will include; METABOLIC_DISORDER, BURNOUT, DEPRESSION, MWK_VAL / SPORTS_T1, SLEEP_QUALITY, SMOKING, SUMOFALCOHOL, SUMOFKCAL, DBP (Diastolic Blood Pressure in mm hg at baseline), HBF (Pulse rate in beats per minute at baseline), FINANCE, BMI, WEIGHT, HEIGHT. These different factors in lifestyle can be viewed in different ways; the datadashboard will have a standard barplot and a interactive barplot these plots can be filtered in different ways through a sidebar. This sidebar will have filters in what provinces are shown, What genders are shown and what age range is show (later financial situation also will be shown). There will be 2 plots a hexbin plot and a barplot (switching between this can be done with a dropdown box). The purpose of this datadashboard is for the user to see what different lifestyles are common in different provinces but also what province is the healthiest or wealthiest this can give them a general idea of what province is benificial for them. It can also influence people to think of certain research that needs to be done. For example if the data shows that the province of Drenthe has an absurd spike in people with lung isues compared to the other provinces it could lead questions about industry and air quality. In this logbook there will be an analysis of the data and a decision about what data to use and what graph to use and how these things will be shown on the data dashboard
19-nov-2024
## here() starts at /Users/jarnoduiker/github_bioinf/Lifelines_datadashboard
20-nov-2024
(T = time stamp)
#ik gebruik hier describe om verschillende statistieken te bekijken van mijn data dit komt uit de psych package.
describe(data_lifelines)
## vars n mean sd median trimmed mad
## GENDER 1 16696 1.59 0.49 2.00 1.61 0.00
## BIRTHYEAR 2 16696 1963.88 11.16 1964.00 1963.77 10.38
## AGE_T1 3 16696 46.62 11.18 47.00 46.67 10.38
## AGE_T2 4 16696 50.39 11.13 51.00 50.48 10.38
## AGE_T3 5 16696 56.40 11.16 57.00 56.51 10.38
## ZIP_CODE 6 16696 9088.02 703.59 9281.00 9150.68 637.52
## BMI_T1 7 16696 25.95 4.11 25.40 25.59 3.56
## WEIGHT_T1 8 16696 79.58 14.72 78.00 78.69 14.83
## HIP_T1 9 16696 99.18 9.18 98.00 98.63 7.41
## HEIGHT_T1 10 16696 174.92 9.35 174.00 174.72 10.38
## WAIST_T1 11 16696 90.15 11.80 90.00 89.70 11.86
## BMI_T2 12 16646 26.07 4.21 25.50 25.70 3.71
## WEIGHT_T2 13 16646 79.69 14.87 78.00 78.83 14.83
## HIP_T2 14 16646 99.76 9.15 99.00 99.17 7.41
## HEIGHT_T2 15 16646 174.66 9.43 174.00 174.46 10.38
## WAIST_T2 16 16646 90.10 12.15 89.00 89.64 11.86
## HEIGHT_T3 17 16696 174.22 9.51 173.50 174.03 10.38
## WEIGHT_T3 18 16696 81.85 15.66 80.30 80.89 15.27
## HIP_T3 19 16693 102.20 9.59 101.00 101.48 7.41
## WAIST_T3 20 16696 92.35 13.79 92.00 91.81 13.34
## EDUCATION_LOWER_T1 21 16666 0.25 0.44 0.00 0.19 0.00
## EDUCATION_LOWER_T2 22 15215 0.26 0.44 0.00 0.20 0.00
## FINANCE_T1 23 16696 6.96 2.62 7.00 7.23 2.97
## WORK_T1 24 15077 0.74 0.44 1.00 0.80 0.00
## WORK_T2 25 15218 0.77 0.42 1.00 0.84 0.00
## LOW_QUALITY_OF_LIFE_T1 26 16694 0.08 0.28 0.00 0.00 0.00
## LOW_QUALITY_OF_LIFE_T2 27 15230 0.10 0.30 0.00 0.00 0.00
## DBP_T1 28 16690 74.12 9.28 73.00 73.70 8.90
## DBP_T2 29 16631 74.13 9.39 73.00 73.75 8.90
## HBF_T1 30 16690 70.62 10.74 70.00 70.27 10.38
## HBF_T2 31 16631 68.54 11.11 68.00 68.09 10.38
## MAP_T1 32 16688 93.45 10.01 92.00 92.80 8.90
## MAP_T2 33 16631 94.53 10.36 93.00 93.85 8.90
## SBP_T1 34 16690 125.71 14.92 124.00 124.92 14.83
## SBP_T2 35 16631 128.18 15.91 127.00 127.31 16.31
## HTN_MED_T1 36 16630 0.11 0.31 0.00 0.01 0.00
## CHO_T1 37 16615 5.14 0.99 5.10 5.10 1.04
## GLU_T1 38 16557 5.00 0.78 4.90 4.92 0.44
## CHO_T2 39 16234 5.12 0.97 5.10 5.09 1.04
## GLU_T2 40 16121 5.05 0.83 4.90 4.95 0.44
## RESPIRATORY_DISEASE_T1 41 16655 0.08 0.27 0.00 0.00 0.00
## SMOKING 42 16438 0.17 0.37 0.00 0.09 0.00
## METABOLIC_DISORDER_T1 43 16694 0.00 0.43 0.00 0.00 0.00
## METABOLIC_DISORDER_T2 44 16694 0.02 0.46 0.00 0.00 0.00
## LLDS 45 15017 24.88 6.08 25.00 24.86 5.93
## SUMOFALCOHOL 46 11201 7.56 8.56 5.29 6.06 6.64
## SUMOFKCAL 47 11314 2010.83 628.25 1923.36 1961.05 533.65
## MWK_VAL 48 15574 508.20 665.92 270.00 360.13 289.11
## SCOR_VAL 49 15578 2654.12 3353.85 1500.00 1931.82 1630.86
## MWK_NO_VAL 50 15576 279.49 303.19 200.00 229.50 207.56
## SCOR_NO_VAL 51 15576 1502.57 1576.97 1080.00 1251.83 1156.43
## SPORTS_T1 52 15578 0.59 0.49 1.00 0.61 0.00
## CYCLE_COMMUTE_T1 53 14573 0.41 0.49 0.00 0.39 0.00
## VOLUNTEER_T1 54 14030 0.33 0.47 0.00 0.29 0.00
## PREGNANCIES 55 9167 1.92 1.19 2.00 1.90 1.48
## OSTEOARTHRITIS 56 16696 0.08 0.27 0.00 0.00 0.00
## BURNOUT_T1 57 16696 0.09 0.29 0.00 0.00 0.00
## DEPRESSION_T1 58 16696 0.10 0.29 0.00 0.00 0.00
## SLEEP_QUALITY 59 8118 0.35 0.48 0.00 0.32 0.00
## DIAG_CFS_CDC 60 14374 0.03 0.17 0.00 0.00 0.00
## DIAG_FIBROMYALGIA_ACR 61 14213 0.06 0.24 0.00 0.00 0.00
## DIAG_IBS_ROME3 62 14382 0.05 0.23 0.00 0.00 0.00
## C_SUM_T1 63 15484 29.81 3.36 30.00 29.89 2.97
## A_SUM_T1 64 15497 18.39 4.28 18.00 18.25 4.45
## SC_SUM_T1 65 15530 19.64 4.65 19.00 19.43 4.45
## I_SUM_T1 66 15463 22.03 3.92 22.00 21.96 4.45
## E_SUM_T1 67 15494 21.79 4.60 22.00 21.76 4.45
## SD_SUM_T1 68 15536 29.46 4.27 30.00 29.64 2.97
## V_SUM_T1 69 15559 18.13 4.09 18.00 17.94 4.45
## D_SUM_T1 70 15548 28.58 4.10 29.00 28.78 4.45
## LTE_SUM_T1 71 16344 1.00 1.22 1.00 0.79 1.48
## LDI_SUM_T1 72 16308 2.37 2.29 2.00 2.04 1.48
## LTE_SUM_T2 73 14947 0.79 1.09 0.00 0.60 0.00
## LDI_SUM_T2 74 15001 2.06 2.25 1.00 1.69 1.48
## NSES_YEAR 75 16696 2009.36 1.46 2010.00 2009.70 0.00
## NSES 76 16200 -0.58 1.08 -0.58 -0.55 1.04
## NEIGHBOURHOOD1_T2 77 11747 8.22 1.46 8.00 8.38 1.48
## NEIGHBOURHOOD2_T2 78 11806 1.94 0.82 2.00 1.87 1.48
## NEIGHBOURHOOD3_T2 79 11812 1.45 0.68 1.00 1.34 0.00
## NEIGHBOURHOOD4_T2 80 11810 1.76 1.04 1.00 1.56 0.00
## NEIGHBOURHOOD5_T2 81 11809 3.69 1.02 4.00 3.78 1.48
## NEIGHBOURHOOD6_T2 82 11812 4.08 0.81 4.00 4.17 0.00
## MENTAL_DISORDER_T1 83 16320 0.08 0.32 0.00 0.00 0.00
## MENTAL_DISORDER_T2 84 13472 0.09 0.33 0.00 0.00 0.00
## min max range skew kurtosis se
## GENDER 1.00 2.00 1.00 -0.35 -1.88 0.00
## BIRTHYEAR 1927.00 1995.00 68.00 0.08 -0.21 0.09
## AGE_T1 18.00 84.00 66.00 -0.02 -0.25 0.09
## AGE_T2 20.00 88.00 68.00 -0.05 -0.21 0.09
## AGE_T3 25.00 95.00 70.00 -0.07 -0.18 0.09
## ZIP_CODE 1015.00 9998.00 8983.00 -1.41 7.47 5.45
## BMI_T1 15.40 53.80 38.40 1.18 2.95 0.03
## WEIGHT_T1 42.00 158.00 116.00 0.68 0.86 0.11
## HIP_T1 62.00 185.00 123.00 0.93 3.03 0.07
## HEIGHT_T1 137.00 207.00 70.00 0.18 -0.40 0.07
## WAIST_T1 60.00 156.00 96.00 0.47 0.60 0.09
## BMI_T2 13.00 54.50 41.50 1.18 2.91 0.03
## WEIGHT_T2 43.50 160.00 116.50 0.65 0.70 0.12
## HIP_T2 68.00 192.50 124.50 0.99 3.19 0.07
## HEIGHT_T2 116.50 206.00 89.50 0.16 -0.27 0.07
## WAIST_T2 52.00 155.00 103.00 0.44 0.41 0.09
## HEIGHT_T3 108.00 208.50 100.50 0.12 0.03 0.07
## WEIGHT_T3 36.70 168.30 131.60 0.72 1.07 0.12
## HIP_T3 0.00 175.00 175.00 0.74 6.01 0.07
## WAIST_T3 11.00 761.00 750.00 7.17 330.41 0.11
## EDUCATION_LOWER_T1 0.00 1.00 1.00 1.13 -0.73 0.00
## EDUCATION_LOWER_T2 0.00 1.00 1.00 1.12 -0.75 0.00
## FINANCE_T1 1.00 10.00 9.00 -0.75 -0.40 0.02
## WORK_T1 0.00 1.00 1.00 -1.11 -0.77 0.00
## WORK_T2 0.00 1.00 1.00 -1.27 -0.38 0.00
## LOW_QUALITY_OF_LIFE_T1 0.00 1.00 1.00 3.03 7.19 0.00
## LOW_QUALITY_OF_LIFE_T2 0.00 1.00 1.00 2.63 4.90 0.00
## DBP_T1 47.00 143.00 96.00 0.54 0.70 0.07
## DBP_T2 43.00 128.00 85.00 0.43 0.26 0.07
## HBF_T1 31.00 147.00 116.00 0.41 0.73 0.08
## HBF_T2 34.00 142.00 108.00 0.47 0.64 0.09
## MAP_T1 0.00 160.00 160.00 0.68 2.16 0.08
## MAP_T2 65.00 151.00 86.00 0.67 0.61 0.08
## SBP_T1 72.00 221.00 149.00 0.64 1.03 0.12
## SBP_T2 86.00 213.00 127.00 0.57 0.40 0.12
## HTN_MED_T1 0.00 1.00 1.00 2.49 4.19 0.00
## CHO_T1 2.10 9.50 7.40 0.36 0.18 0.01
## GLU_T1 2.70 22.10 19.40 6.05 85.10 0.01
## CHO_T2 2.10 9.80 7.70 0.36 0.31 0.01
## GLU_T2 2.80 20.60 17.80 4.87 48.42 0.01
## RESPIRATORY_DISEASE_T1 0.00 1.00 1.00 3.08 7.50 0.00
## SMOKING 0.00 1.00 1.00 1.77 1.12 0.00
## METABOLIC_DISORDER_T1 -9.00 1.00 10.00 -17.24 349.06 0.00
## METABOLIC_DISORDER_T2 -9.00 1.00 10.00 -15.03 288.83 0.00
## LLDS 4.00 46.00 42.00 0.02 -0.20 0.05
## SUMOFALCOHOL 0.00 76.49 76.49 1.89 5.11 0.08
## SUMOFKCAL 4.18 7460.75 7456.57 1.28 4.90 5.91
## MWK_VAL 0.00 6227.00 6227.00 2.48 7.25 5.34
## SCOR_VAL 0.00 33960.00 33960.00 2.41 7.07 26.87
## MWK_NO_VAL 0.00 4450.00 4450.00 2.96 16.92 2.43
## SCOR_NO_VAL 0.00 26570.00 26570.00 2.70 16.57 12.64
## SPORTS_T1 0.00 1.00 1.00 -0.37 -1.86 0.00
## CYCLE_COMMUTE_T1 0.00 1.00 1.00 0.36 -1.87 0.00
## VOLUNTEER_T1 0.00 1.00 1.00 0.71 -1.49 0.00
## PREGNANCIES 0.00 9.00 9.00 0.19 0.62 0.01
## OSTEOARTHRITIS 0.00 1.00 1.00 3.05 7.31 0.00
## BURNOUT_T1 0.00 1.00 1.00 2.84 6.06 0.00
## DEPRESSION_T1 0.00 1.00 1.00 2.76 5.62 0.00
## SLEEP_QUALITY 0.00 1.00 1.00 0.61 -1.62 0.01
## DIAG_CFS_CDC 0.00 1.00 1.00 5.53 28.61 0.00
## DIAG_FIBROMYALGIA_ACR 0.00 1.00 1.00 3.70 11.70 0.00
## DIAG_IBS_ROME3 0.00 1.00 1.00 3.92 13.40 0.00
## C_SUM_T1 12.00 40.00 28.00 -0.31 0.84 0.03
## A_SUM_T1 8.00 38.00 30.00 0.35 0.21 0.03
## SC_SUM_T1 8.00 40.00 32.00 0.43 0.14 0.04
## I_SUM_T1 8.00 38.00 30.00 0.17 0.06 0.03
## E_SUM_T1 8.00 39.00 31.00 0.06 -0.22 0.04
## SD_SUM_T1 11.00 40.00 29.00 -0.51 0.70 0.03
## V_SUM_T1 8.00 38.00 30.00 0.55 0.86 0.03
## D_SUM_T1 12.00 40.00 28.00 -0.49 0.45 0.03
## LTE_SUM_T1 0.00 11.00 11.00 1.66 4.24 0.01
## LDI_SUM_T1 0.00 17.00 17.00 1.40 2.51 0.02
## LTE_SUM_T2 0.00 12.00 12.00 2.29 10.93 0.01
## LDI_SUM_T2 0.00 23.00 23.00 2.07 8.60 0.02
## NSES_YEAR 2006.00 2010.00 4.00 -1.86 1.47 0.01
## NSES -7.12 2.93 10.05 -0.30 0.76 0.01
## NEIGHBOURHOOD1_T2 1.00 10.00 9.00 -2.02 7.04 0.01
## NEIGHBOURHOOD2_T2 1.00 5.00 4.00 0.74 0.56 0.01
## NEIGHBOURHOOD3_T2 1.00 5.00 4.00 1.87 5.01 0.01
## NEIGHBOURHOOD4_T2 1.00 5.00 4.00 1.35 1.05 0.01
## NEIGHBOURHOOD5_T2 1.00 5.00 4.00 -0.67 0.12 0.01
## NEIGHBOURHOOD6_T2 1.00 5.00 4.00 -1.23 2.58 0.01
## MENTAL_DISORDER_T1 0.00 4.00 4.00 4.92 30.49 0.00
## MENTAL_DISORDER_T2 0.00 5.00 5.00 4.73 27.68 0.00
Here I use describe, it says the following things.
vars notes the variable index.
n is the number of values.
mean is the average.
sd is the standard deviation.
median is the middle value.
trimmed is the mean after trimming 10% of the observations from each tail.
mad is the median of the absolute deviation.
min and max are the minimum and maximum values.
range is the difference between the maximum and the minimum.
skew is the skewness of the distribution. (between -1 & +1 is perfect between -2 and +2 is acceptable) Hair, J.F., Hult, G.T.M., Ringle, C.M., & Sarstedt, M. (2022). A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM) (3 ed.). Thousand Oaks, CA: Sage.
kurtosis is the measure of the ‘tailiness’ of the distribution.
se is the standard error.
DATA ANALYSIS The headers need to be changed because it is now unreadable without a codebook, There are many NA’s between measurements. Furthermore, the csv can be loaded well and there are no problems with it.
Here i’ve found all postal codes for each province. This is necasary to make a new catagory in the dataframe that shows from what province each participant is. These zipcodes are put in lists. Also the dataframe will be mutated (this is where the new column is made).
Friesland postalcode 8388 - 9299 + 9850 - 9859
friesland_zipcodes <- c(
8401:8409, 8411:8417, 8421:8429, 8431:8439, 8441:8449,
8451:8459, 8461:8469, 8471:8479, 8481:8489, 8491:8499,
8501:8509, 8511:8519, 8521:8529, 8531:8539, 8541:8549,
8551:8559, 8561:8569, 8571:8579, 8581:8589, 8591:8599,
8601:8609, 8611:8619, 8621:8629, 8631:8639, 8641:8649,
8651:8659, 8661:8669, 8671:8679, 8681:8689, 8691:8699,
8701:8709, 8711:8719, 8721:8729, 8731:8739, 8741:8749,
8751:8759, 8761:8769, 8771:8779, 8781:8789, 8791:8799,
8801:8809, 8811:8819, 8821:8829, 8831:8839, 8841:8849,
8851:8859, 8861:8869, 8871:8879, 8881:8889, 8891:8899,
9001:9009, 9011:9019, 9021:9029, 9031:9039, 9041:9049,
9051:9059, 9061:9069, 9071:9079, 9081:9089, 9091:9099,
9101:9109, 9111:9119, 9121:9129, 9131:9139, 9141:9149,
9151:9159, 9161:9169, 9171:9179, 9181:9189, 9191:9199,
9201:9209, 9211:9219, 9221:9229, 9231:9239, 9241:9249,
9251:9259, 9261:9269, 9271:9279, 9281:9289, 9291:9299
)
groningen_zipcodes <- c(
2750:2752, 2760:2761, 2811, 2840:2841, 2910:2914,
5340:5359, 5366:5368, 5370:5371, 5373, 5386, 5394:5398,
9350:9351, 9354:9356, 9359, 9361:9367, 9479,
9500:9503, 9540:9541, 9545, 9550:9551, 9560:9561, 9563, 9566,
9580:9581, 9584:9585, 9591, 9600:9611, 9613:9629,
9631:9633, 9635:9636, 9640:9649, 9651, 9661, 9663, 9665,
9670:9675, 9677:9679, 9681:9688, 9691, 9693, 9695:9704,
9711:9718, 9721:9728, 9731:9738, 9741:9747, 9750:9756,
9771, 9773:9774, 9790:9798, 9800:9805, 9811:9812,
9821:9825, 9827:9828, 9831:9833, 9841:9845,
9860:9866, 9881:9886, 9891:9893, 9900:9915,
9917:9925, 9930:9934, 9936:9937, 9939, 9942:9949,
9951, 9953:9957, 9961:9969, 9970:9999
)
drenthe_zipcodes <- c(
3925, 7705, 7740:7742, 7750:7751, 7753:7756, 7760:7761, 7764:7766,
7800:7801, 7811:7815, 7821:7828, 7830:7831, 7833, 7840:7849, 7851:7856,
7858:7859, 7860:7864, 7871:7877, 7880:7881, 7884:7885, 7887, 7889:7892,
7894:7895, 7900:7918, 7920:7929, 7931:7938, 7940:7944, 7946, 7948:7949,
7957:7958, 7960:7966, 7970:7975, 7980:7986, 7990:7991, 8066, 8325:8326,
8334:8339, 8341:8347, 8351, 8355:8356, 8361:8363, 8371:8378, 8380:8398,
8420:8428, 8430:8435, 8437:8439, 8470:8479, 8481:8489, 9300:9307,
9311:9315, 9320:9321, 9330:9337, 9341:9343, 9351, 9400:9423, 9430:9439,
9441:9449, 9450:9469, 9470:9475, 9480:9489, 9491:9497, 9511:9512, 9514:9515,
9520:9528, 9530:9537, 9564, 9571, 9573:9574, 9654:9659, 9749, 9760:9761,
9765:9766, 9780:9785, 9959
)
#Here the dataframe is mutated by adding a new column for what province each participant is from. Using case when this works by looking into the ZIP_CODE column what number it is, if its in the friesland_zipcodes it will put down Friesland in the Province column if its not in friesland_zipcodes it will go to the next option.
data_lifelines <- data_lifelines %>%
mutate(Province = case_when(ZIP_CODE %in% friesland_zipcodes ~ "Friesland",
ZIP_CODE %in% groningen_zipcodes ~ "Groningen",
ZIP_CODE %in% drenthe_zipcodes ~ "Drenthe"))
#here the datalifelines is headed and only the province column to see if it worked
head(data_lifelines$Province)
## [1] "Groningen" "Friesland" "Friesland" "Drenthe" "Friesland" NA
#Here i get gender from the dataframe, with the %>% i transform the 1 or 2 option to Male or Female
GENDER1 <- data_lifelines$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))
#Here i use ggplot to look at the average alcohol consumption per gender in grams
ggplot(data_lifelines, aes(x=GENDER1, y=SUMOFALCOHOL, fill=GENDER1)) +
geom_boxplot() +
ylab("Sum of alcohol per week in grams") +
xlab("Gender") +
labs(title="How many grams of alchol does each gender average per week")
## Warning: Removed 5495 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
In this boxplot you can see that men average more gram of alochol per
week then females do, however it’s not such a significante difference
that it’s something to show in the datadashboard.
ggplot(data_lifelines, aes(x=GENDER1, y=SUMOFKCAL, fill=GENDER1)) +
geom_boxplot() +
xlab("Gender") +
ylab("Sum of kcal - per day") +
labs(title = "Sum of kcal - per day, per gender")
## Warning: Removed 5382 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Here is a boxplot of the sum of kcal for each gender per day. As
expected the average for men is around 2300-2400 and the average for
females is around 1800-1900. Something notable is that there are no
males that have a extreme low kcal when there are females that eat 0 or
close to 0 kcal each day. This however could be a randomisation error.
This is because to keep the data private the lifelines project
ggplot(data_lifelines, aes(x=WEIGHT_T1, y=HEIGHT_T1)) +
geom_point(alpha= 0.2, size=.2) +
xlab("Weight on measuring moment 1 (in KG)") +
ylab("Height in cm") +
labs(title = "Length and weight scatterplot")
In this scatterplot the height and weight is show in a scatterplot. due
to the large amount of data the points had to become quite small and the
alpha has been changed to very light. this plot does show a general area
where the dots are. This is between 60-100kg and 150 and 190 cm. This
plot is not very interesting and therefore will not be shown in the
datadashboard.
ggplot(data_lifelines, aes(x= PREGNANCIES)) +
geom_bar(fill="#69b3a2", alpha=0.8) +
xlab("Prengnancies") +
labs(title = "How many pregnancies do women have")
## Warning: Removed 7529 rows containing non-finite outside the scale range
## (`stat_count()`).
In the lifelines form women have been asked how many pregnancies women
had, this is displayed here. It seems like the most common thing is to
have 2 pregnancies or two children then 3 then 0. This seems quite
common and not necesarly notable for the dashboard.
ggplot(data_lifelines, aes(x=AGE_T1, y=WEIGHT_T1)) +
geom_jitter(alpha= 0.2, size=.2)+
ylab("Weight on measuring moment 1 (in KG)") +
xlab("age") +
labs(title = "Age and weight scatterplot")
In this scatterplot the weight and age are being compared and there is
not a notable spike or difference in the ages. the only thing i see is
allot of data points near the 50 yr age point but this is due to most
contestants being this age.
27 nov making a subset for smokers vs non smokers
smokers_lifeline <- subset(data_lifelines, data_lifelines$SMOKING > 0)
non_smokers_lifeline <- subset(data_lifelines, data_lifelines$SMOKING == 0)
gender_smokers <- smokers_lifeline$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))
head(gender_smokers)
## [1] Female Female Male Male Female Male
## Levels: Male Female
ggplot(smokers_lifeline, mapping = aes(x=gender_smokers, fill = factor(RESPIRATORY_DISEASE_T1))) +
geom_bar() +
xlab("Gender") +
ylab("Count of people") +
labs(title = "Smoker's and non smokers for each gender", fill = "Smoking
0 = Non-smoker
1 = Smoker")
In this barplot the smokers per gender is shown. Interestingly there are
a lot less smokers then expected. less then 250! This could be
intersting to compare the ammount of smokers with the ammount of lung
problems.
data_lifelines_gender <- data_lifelines$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))
ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(data_lifelines_gender))) +
geom_bar()+
xlab("Gender") +
ylab("Count of people") +
labs(title = "Genders per province", fill = "Gender") +
coord_flip()
Here is a plot shown that shows the distrubution of male and female per
province. This graph is nice and could be a goodfit on the
datadashboard. This graph can display difference between male and female
well when the datadashboard is filtered by a filter function.
data_lifelines_gender <- data_lifelines$GENDER %>% factor(levels = c(1,2), labels = c("Male", "Female"))
ggplot(data_lifelines, mapping = aes(x=Province,y=BMI_T1, color = factor(data_lifelines_gender))) +
geom_quasirandom()+
xlab("Gender") +
ylab("Count of people") +
labs(title = "Genders per province", fill = "Gender")
ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(DEPRESSION_T1))) +
geom_bar()+
xlab("Age") +
ylab("Count of people") +
labs(title = "This graph shows if people have depression shown by age", fill = "Depression
0 = Not depressed
1 = depressed")
This graph shows the depression per age, it is hard to say something
about the depression this is because there are way more awnsers in the
30-50 range so comparing it will be hard. depression definitly has a
place on the datadashboard but not in this graph.
ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(BURNOUT_T1))) +
geom_bar()+
xlab("Age") +
ylab("Count of people") +
labs(title = "This graph shows if people have a burnout shown by age", fill = "Burnout
0 = Not Burned out
1 = In a burnout")
As the previous graph this shows burnout by age, it seems that the
people who get burnouts start at 30 and they are quite high between
30-50 meaning that the stress that people are getting pre 30 years old
is not enough to cause a burnout. This deffinitly could be a plot that
should be shown on the datadashboard.
ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(OSTEOARTHRITIS))) +
geom_bar() +
xlab("Age") +
ylab("Count of people") +
labs(title = "This graph shows if people have depression shown by age", fill = "osteoarthirittis
0 = doesn't have osteoarthirittis
1 = does have osteoarthirittis")
Osteoarthritis is a degenerative joint disease, in which the tissues in
the joint break down over time. It is the most common type of arthritis
and is more common in older people. People with osteoarthritis usually
have joint pain and, after rest or inactivity, stiffness for a short
period of time. In this plot the peak in the 50 years can be ignored
this is because there is almost triple the awnsers for it compared to
the next ages. However it is shown that this joint disease is indeed
more prominent in the older people. This confirms what is said so is not
a very interisting plot but we could show a plot per province to see
what the difference is between them.
ggplot(data_lifelines, mapping = aes(x=AGE_T1, fill = factor(FINANCE_T1))) +
geom_bar() +
xlab("Age") +
ylab("Count of people") +
labs(title = "This graph shows the financial situation per age", fill = "Finance situatio
1 = worst
10 = best (3500$+ a month)")
This plot shows the financial situation per age, This plot is kind of
difficult also because of the big difference between count between 50
and 50+ this finance will be good to show in correlation with other
lifestyle factors but this plot in perticular is not interesting
enough
ggplot(data_lifelines, aes(SUMOFALCOHOL, FINANCE_T1)) +
geom_jitter(width = .5, size=1) +
xlab("Gram of alc per week") +
ylab("Financial situation") +
labs(title = "This plot shows the financial situation and alcohol in gram per week")
## Warning: Removed 5495 rows containing missing values or values outside the scale range
## (`geom_point()`).
This plot shows how many grams of alcohol per week is getting drank and
how the subjects financial situation is. It is not a really note worthy
plot in my opinion so this will not be used in the final
datadashboard.
ggplot(data_lifelines, aes(SUMOFALCOHOL, PREGNANCIES)) +
geom_jitter(width = .5, size=1) +
xlab("Age") +
ylab("Count of people") +
labs(title = "This plot shows the financial situation and alcohol in gram per week")
## Warning: Removed 10390 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(data_lifelines, mapping = aes(x=GENDER1, fill = factor(DEPRESSION_T1))) +
geom_bar()
This plot shows depression per gender it’s not that interesting and will
not bed added to the final app
bin<-hexbin(data_lifelines$WEIGHT_T1, data_lifelines$HEIGHT_T1, xbins=40)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin,
main = "This Hexbin shows height and weight",
xlab = "Weight in KG",
ylab = "Height in cm",
colramp = my_colors,
legend = FALSE)
A hexbin plot is useful to represent the relationship of 2 numerical
variables when you have a lot of data points. Without overlapping of the
points, the plotting window is split into several hexbins.
bin_alc<-hexbin(data_lifelines$AGE_T1, data_lifelines$A_SUM_T1, xbins=20)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin_alc,
main = "This Hexbin shows height and weight",
xlab = "AGE",
ylab = "Alcohol in gram per week",
colramp = my_colors,
legend = FALSE)
This hexbin isn’t a really interesting plot in my opinion also because
hexbins are hard to understand and it doesn’t show something significant
therefore it will not be in the final app.
bin_blp_t1<-hexbin(data_lifelines$WEIGHT_T1, data_lifelines$DBP_T1, xbins=40)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin_blp_t1,
xlab = "Weight (kg)",
ylab = "DBP (mm hg)",
colramp = my_colors)
bin_blp_t2<-hexbin(data_lifelines$WEIGHT_T2, data_lifelines$DBP_T2, xbins=40)
my_colors=colorRampPalette(rev(brewer.pal(11,'Spectral')))
plot(bin_blp_t2,
xlab = "Weight (kg)",
ylab = "DBP (mm hg)",
colramp = my_colors)
Here two hexbin’s have been made and they show the DBP and WEIGHT on 2
different measuring moments. This is to see if there is a difference
between them, in the second measuring moment it seems that the weight
has increased but the outliers in the max DBP have gone down. In the
first measuring moment there are two counts of 140 dbp that is
incredibly high and dangerous.
data_sleepqual_dbp <- na.omit(data_lifelines)
ggplot(data_sleepqual_dbp, aes(x=factor(SLEEP_QUALITY), y=DBP_T1)) +
geom_violin() +
xlab("Sleep quality") +
ylab("DBP (mm hg)") +
theme_minimal()
This is a violin plot, A violin plot depicts distributions of numeric
data for one or more groups using density curves. The width of each
curve corresponds with the approximate frequency of data points in each
region. This plot shows DBP for people who have a good sleep quality and
people who have bad sleep quality. The plot shows that people that have
a better sleep quality being 1 do have a generally lower dbp. This plot
will be put in the app that would be interesting with the interactive
filters.
ggplot(data_lifelines, aes(x=factor(FINANCE_T1), y=DBP_T1)) +
geom_boxplot() +
xlab("Finance") +
ylab("DBP (mm hg)") +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Here is a boxplot of dbp for each financial situation. the median show’s
average, the average is the lowest for people who have about 750 to
spend and in financial situation 7 the meidan is the highest. Meaning
more money makes the average DBP rise but it cannot be said for certain
ofcourse because there are more factors to look at.
ggplot(data_lifelines, aes(x=!is.na(SPORTS_T1), y=DBP_T1)) +
geom_boxplot() +
xlab("Participates in sports") +
ylab("Cholesterol (mmol/L)") +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Here is a boxplot that show’s the difference in cholesterol measured in
mmol/L between people who do sports and people who don’t do sports.
there isn’t a big difference but maybe with the filtering there could be
an interesting result.
added some plotly interactive plots of the already static plots with the following code. Plotly makes ggplot interactive, this will me done by rendering the plot using ggplotly(plot).
p_age_count <- ggplot(data_lifelines, aes(x = AGE_T1, fill = factor(GENDER))) +
geom_bar() +
ylab("Count of People") +
xlab("Province") +
labs(fill = "Gender (1 = Male, 2 = Female)") +
theme_minimal() +
coord_flip()
ggplotly(p_age_count)
Here is an example of what a plot looks like when rendered with ggplotly. The user can now see the exact count and what gender this line represents. This will be implemented in the app. Making it so the user has a static and interactive plot he can interact with.
netherlands <- rnaturalearth::ne_states(country = "Netherlands", returnclass = "sf")
selected_provinces <- netherlands %>%
filter(name %in% c("Groningen", "Friesland", "Drenthe"))
tm_shape(selected_provinces) +
tm_polygons(col = "name", title = "Province", border.col = "black") +
tm_layout(title = "Selected Dutch Provinces") +
tm_borders()
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
## This message is displayed once every 8 hours.
I thought a fun way to show where the data came from i’d show the
provices with this plot. This shows Friesland, Groningen and Drenthe.
This map will be included in the FAQ.
ggplot(data_lifelines, aes(x=!is.na(SPORTS_T1), y=CHO_T1)) +
geom_boxplot() +
xlab("Participates in sports") +
ylab("Cholesterol (mmol/L)") +
theme_minimal()
## Warning: Removed 81 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Here is a boxplot that show’s the difference in people who do and don’t
do sports and their cholesterol levels. It is generally known that
playing sports could improve your general health. This might be
interesting to show in the app.
ggplot(data_lifelines, aes(x=!is.na(SPORTS_T1), y=DBP_T1)) +
geom_boxplot() +
xlab("Participates in sports") +
ylab("DBP (mm hg)") +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Here is a boxplot that show’s the difference in people who do and don’t
do sports and their DBP levels. It is known that playing sports could
lower your bloodpressure in general. This might be interesting to show
in the app. The filtering could pottentially show some interesting
results.
ggplot(data_lifelines, aes(x = DEPRESSION_T1, y = factor(FINANCE_T1), fill = FINANCE_T1)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")
## Picking joint bandwidth of 0.0677
This ridge plot shows the density distributions of depression scores
(DEPRESSION_T1) across different financial levels (FINANCE_T1, grouped
as 1–10). Each curve represents a financial group, with the shape and
position indicating how depression scores are distributed within that
group. A systematic shift or variation in the peaks suggests a potential
relationship between financial status and depression, where higher
financial levels might be associated with lower or more stable
depression scores. The gradient color aids in distinguishing the groups
visually. it seems that people in the financial group 4-5 have more
people with depression then the higher financial situations. This could
be due to multiple reasons like stress from not being able to afford
stuff to high work stress to make enough money to get by.
ggplot(data=data_lifelines, aes(x=NSES, group=factor(FINANCE_T1), fill=FINANCE_T1)) +
geom_density(adjust=1.5) +
facet_wrap(~FINANCE_T1)
## Warning: Removed 496 rows containing non-finite outside the scale range
## (`stat_density()`).
NSES = Neighborhood socio-economic status score according to CBS
Statistics Netherlands, based on inhabitants’ educational level, income
and job prospective. This ridge plot shows the density distributions of
NSES scores across different financial levels (FINANCE_T1, grouped as
1–10). Each curve represents a financial group, with the shape and
position indicating how NSES scores are distributed within that group. A
higher score meaning better neighborhood that could be good for people’s
mental and physical health. This plot will be interesting to show in the
app.
ggplot(data_lifelines, aes(x = BURNOUT_T1, y = factor(FINANCE_T1), fill = FINANCE_T1)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none")
## Picking joint bandwidth of 0.0606
This ridge plot shows the density distributions of burnout scores across
different financial levels (FINANCE_T1, grouped as 1–10). Each curve
represents a financial group, with the shape and position indicating how
burnout scores are distributed within that group. Compared to the
depression plot there arent any significant or less then significant
peaks. For this reason it will not be interesting to show this plot.
ggplot(data=data_lifelines, aes(x=SUMOFALCOHOL, group=factor(DEPRESSION_T1), fill=DEPRESSION_T1)) +
geom_density(adjust=1.5) +
facet_wrap(~DEPRESSION_T1) +
theme(legend.position = "none")
## Warning: Removed 5495 rows containing non-finite outside the scale range
## (`stat_density()`).
This ridge plot shows the density distributions of how much alcohol
someone drinks a day in grams between people who and don’t have
depression. As of the basic filter it doesn’t seem that there is a
significant difference in alcohol consumption between people who do and
don’t have depression. There fore putting it in the app might not be
very important but it could serve as a backup
The plan for the datadashboard will be to make the user able to filter the data. The data can be filterd in 3 ways; gender, what provinces should be shown and what age range should be shown. Then the user can select a plot. These plots will be preselected by me and it will show a couple of lifestyle factors i have selected.
i found that these catagories would be interesting at first, then i looked at the different graphs in the analysis above and it showed me that it’s very important to have a good way to visualize these. That being said some stuff also appeared less intersting like bmi weight and height this is because almost every data about provinces shows this. Thus making it not that interesting. Smoking, depression burnout and sum of alcohol could be interesting though. This will show what province best is for example for the least stress in context of a burnout.
The focus started to shift to more specialised, an intrest in DBP arose and factors that could influence into a heightend DBP. The reason for this focus is that DBP could go unnoticed for people who don’t pay attention or a ignorant to the symptoms. a HIGH DBP can cause a clogged artery that could be blocked completly in case of a blood cloth this could cause a variety of isues sometimes being deadly like a heartattack, a vein could also tear causing other isues inclueding a aneurism. There for making people aware of felt like a good goal. All of these graphs in the app will have a direct or indirect relation to the DBP.